Table of Content

1 Preperation

2 Exploratory Analysis

3 Final Plots and Summary

4 Reflection

5 References

Appendix


1 Preperation

# Download csv-file to working directory:
URL <- "https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityReds.csv"
download.file(URL, destfile = "./wine.csv", method="curl")

# Open data:
data = read.csv("wine.csv")

I begin with checking for missing values:

anyNA(data)
## [1] FALSE

There are no missing values in the dataset. I can directly start with exploring the data:

2 Exploratory Analysis

Note: A detailed description of the variables can be found in the appendix.

2.1 Univariate Plots and Analysis

dim(data)
## [1] 1599   13

The dataset consists of 1599 cases with 13 variables per case (1 index, 11 independent, and 1 dependent variables). Every case (i.e. every row in the dataset) represents one particular red wine which was tasted and evaluated by wine experts. The result of this evaluation is stored in the variable quality.

As usual I start the exploration with printing out the first three rows of the dataset:

head(data,3)
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
##   quality
## 1       5
## 2       5
## 3       5

I continue with the so called Five Number Summary of the data, which shows the spread (minimium and maxiumun), the mean and the 2- and 3-quantiles:

summary(data)
##        X        fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1   Min.   : 4.60   Min.   :0.120    Min.   :0.000  
##  1st Qu.: 400   1st Qu.: 7.10   1st Qu.:0.390    1st Qu.:0.090  
##  Median : 800   Median : 7.90   Median :0.520    Median :0.260  
##  Mean   : 800   Mean   : 8.32   Mean   :0.528    Mean   :0.271  
##  3rd Qu.:1200   3rd Qu.: 9.20   3rd Qu.:0.640    3rd Qu.:0.420  
##  Max.   :1599   Max.   :15.90   Max.   :1.580    Max.   :1.000  
##  residual.sugar    chlorides      free.sulfur.dioxide total.sulfur.dioxide
##  Min.   : 0.90   Min.   :0.0120   Min.   : 1.0        Min.   :  6.0       
##  1st Qu.: 1.90   1st Qu.:0.0700   1st Qu.: 7.0        1st Qu.: 22.0       
##  Median : 2.20   Median :0.0790   Median :14.0        Median : 38.0       
##  Mean   : 2.54   Mean   :0.0875   Mean   :15.9        Mean   : 46.5       
##  3rd Qu.: 2.60   3rd Qu.:0.0900   3rd Qu.:21.0        3rd Qu.: 62.0       
##  Max.   :15.50   Max.   :0.6110   Max.   :72.0        Max.   :289.0       
##     density            pH         sulphates        alcohol    
##  Min.   :0.990   Min.   :2.74   Min.   :0.330   Min.   : 8.4  
##  1st Qu.:0.996   1st Qu.:3.21   1st Qu.:0.550   1st Qu.: 9.5  
##  Median :0.997   Median :3.31   Median :0.620   Median :10.2  
##  Mean   :0.997   Mean   :3.31   Mean   :0.658   Mean   :10.4  
##  3rd Qu.:0.998   3rd Qu.:3.40   3rd Qu.:0.730   3rd Qu.:11.1  
##  Max.   :1.004   Max.   :4.01   Max.   :2.000   Max.   :14.9  
##     quality    
##  Min.   :3.00  
##  1st Qu.:5.00  
##  Median :6.00  
##  Mean   :5.64  
##  3rd Qu.:6.00  
##  Max.   :8.00

Beside location parameters it is also useful to investigate the dispersion of the data. I thereforse continue with the standard deviation:

sapply(data[2:13], sd)
##        fixed.acidity     volatile.acidity          citric.acid 
##             1.741096             0.179060             0.194801 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##             1.409928             0.047065            10.460157 
## total.sulfur.dioxide              density                   pH 
##            32.895324             0.001887             0.154386 
##            sulphates              alcohol              quality 
##             0.169507             1.065668             0.807569

Most statistical tests require normally distributed variables.

The following graphs gives us more insights about the distribution of the variables:

par(mfrow=c(3, 5))
colnames <- dimnames(data)[[2]]
for (i in 2:13) {
    hist(data[,i], main=colnames[i], col="gray", border="white")
}

plot of chunk unnamed-chunk-7

As we can see, only a few variables seem to be normally distributed (e.g.: density and pH), moreover most variables are right-skewed, and some variables have only a few states (e.g. chlorides).

The most important variable is quality. For a better evaluation wether this variable is normally distributed or not I print out a larger histogram:

#hist(data$quality, col="gray", labels = TRUE)
library(ggplot2)
ggplot(data, aes(x=quality)) + geom_histogram() + scale_x_discrete() + 
    ggtitle("Histogram of wine quality")

plot of chunk unnamed-chunk-8

From the histogram, we can see that most wines were evaluated with 5 or 6. Contrarily, very high and very low quality wines are not present. The following frequency table shows this more precisely:

library(descr)
freq(data$quality, plot = FALSE)
## data$quality 
##       Frequency  Percent
## 3            10   0.6254
## 4            53   3.3146
## 5           681  42.5891
## 6           638  39.8999
## 7           199  12.4453
## 8            18   1.1257
## Total      1599 100.0000

More than 80% of all wines are evaluated with a 5 or 6 on a 1-to-10-scale.

At this point, I am not sure wether I should assume the variable quality normally distributed or not. The Shapiro-Wilk-Test and the Anderson-Darling-Test which test against the assumption of normality however show a p-value smaller than 0.1:

# Shapiro-Wilk-Test:
shapiro.test(data$quality)
## 
##  Shapiro-Wilk normality test
## 
## data:  data$quality
## W = 0.8576, p-value < 2.2e-16
# Anderson-Darling-Test:
library(nortest)
ad.test(data$quality)
## 
##  Anderson-Darling normality test
## 
## data:  data$quality
## A = 110.6, p-value < 2.2e-16

According to these tests, the null hypothetis that the sample came from a normally distributed population should be rejected. On the other side, the large sample size of n=1599 together with the reasonably accurate bell curve shape indicates that we’re should be on the safe side with assuming quality as normally distributed.

2.2 Bivariate Plots and Analysis

However, it could make sense to summarize the variable. The red vertical line shows the average quality of all wines. I suggest to split the wines in two groups: wines below this line (quality scale of 5 and lower) and wines above this lines (quality scale of 6 and higher).

#hist(data$quality, col="gray", labels = TRUE)
#abline(v=mean(data$quality), col="red", lwd=4)
ggplot(data, aes(x=quality)) + geom_histogram() + scale_x_discrete() + 
    geom_vline(data=data, aes(xintercept=mean(data$quality), colour="red") ) + 
    ggtitle("Histogram of wine quality")

plot of chunk unnamed-chunk-11

The following code generates a new factor variable qual_cat with the values low quality and high quality:

data$qual_cat <- rep(NA, nrow(data)) #create new column and fill with NA
data[data$quality <= 5, ][, "qual_cat"] <- "low Quality"
data[data$quality >= 6, ][, "qual_cat"] <- "high Quality"

table(data$qual_cat)
## 
## high Quality  low Quality 
##          855          744

It could be interesting to explore the independent variables referrring to the new created category qual_cat:

colnames <- dimnames(data)[[2]]
for (i in 2:12) {
    boxplot(data[,i]~data$qual_cat, main=colnames[i])
}

plot of chunk unnamed-chunk-13plot of chunk unnamed-chunk-13plot of chunk unnamed-chunk-13plot of chunk unnamed-chunk-13plot of chunk unnamed-chunk-13plot of chunk unnamed-chunk-13plot of chunk unnamed-chunk-13plot of chunk unnamed-chunk-13plot of chunk unnamed-chunk-13plot of chunk unnamed-chunk-13plot of chunk unnamed-chunk-13

These boxplots show us the maximum and minimium, the 25%, 50% (median) and 75% quantile and the outliers.

We can see that esp. the variables volatile.acidity (= “amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste”), citric.acid,volatile acidity, and alcohol seems to have an influence on the wine quality as the median of both boxplots strongly differs. Some variables also contain many outliers and in some cases the occurance of outliers depends on the wine quality. For example: high quality wines have on average more alcohol, but some wines with high alcohol concentration are low quality wines (outliers). Similar findings for total.sulfur.dioxide: low quality wines have on average more total.sulfur.dioxide, but some wines a high concentraion of total.sulfur.dioxide are high quality wines

To summarize these boxplots: _high quality red wines have less volatile acidity, more citric acid, and more alcohol.

I continue with checking how the independent variables might be influenced by each other. The following matrix shows the spearman correlation coefficient between all independent variable pairs:

library("psych")
corr_result <- corr.test(data[2:11])
corr_result$r
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity              1.00000        -0.256131     0.67170
## volatile.acidity          -0.25613         1.000000    -0.55250
## citric.acid                0.67170        -0.552496     1.00000
## residual.sugar             0.11478         0.001918     0.14358
## chlorides                  0.09371         0.061298     0.20382
## free.sulfur.dioxide       -0.15379        -0.010504    -0.06098
## total.sulfur.dioxide      -0.11318         0.076470     0.03553
## density                    0.66805         0.022026     0.36495
## pH                        -0.68298         0.234937    -0.54190
## sulphates                  0.18301        -0.260987     0.31277
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity              0.114777  0.093705           -0.153794
## volatile.acidity           0.001918  0.061298           -0.010504
## citric.acid                0.143577  0.203823           -0.060978
## residual.sugar             1.000000  0.055610            0.187049
## chlorides                  0.055610  1.000000            0.005562
## free.sulfur.dioxide        0.187049  0.005562            1.000000
## total.sulfur.dioxide       0.203028  0.047400            0.667666
## density                    0.355283  0.200632           -0.021946
## pH                        -0.085652 -0.265026            0.070377
## sulphates                  0.005527  0.371260            0.051658
##                      total.sulfur.dioxide  density       pH sulphates
## fixed.acidity                    -0.11318  0.66805 -0.68298  0.183006
## volatile.acidity                  0.07647  0.02203  0.23494 -0.260987
## citric.acid                       0.03553  0.36495 -0.54190  0.312770
## residual.sugar                    0.20303  0.35528 -0.08565  0.005527
## chlorides                         0.04740  0.20063 -0.26503  0.371260
## free.sulfur.dioxide               0.66767 -0.02195  0.07038  0.051658
## total.sulfur.dioxide              1.00000  0.07127 -0.06649  0.042947
## density                           0.07127  1.00000 -0.34170  0.148506
## pH                               -0.06649 -0.34170  1.00000 -0.196648
## sulphates                         0.04295  0.14851 -0.19665  1.000000
corr_result$p
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity            0.000e+00         0.000000   0.000e+00
## volatile.acidity         0.000e+00         0.000000   0.000e+00
## citric.acid              0.000e+00         0.000000   0.000e+00
## residual.sugar           4.199e-06         0.938917   8.084e-09
## chlorides                1.752e-04         0.014225   2.220e-16
## free.sulfur.dioxide      6.336e-10         0.674701   1.474e-02
## total.sulfur.dioxide     5.709e-06         0.002214   1.555e-01
## density                  0.000e+00         0.378755   0.000e+00
## pH                       0.000e+00         0.000000   0.000e+00
## sulphates                1.648e-13         0.000000   0.000e+00
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity             8.819e-05 3.328e-03           1.521e-08
## volatile.acidity          1.000e+00 1.849e-01           1.000e+00
## citric.acid               1.778e-07 6.661e-15           1.849e-01
## residual.sugar            0.000e+00 2.879e-01           1.218e-12
## chlorides                 2.617e-02 0.000e+00           1.000e+00
## free.sulfur.dioxide       4.685e-14 8.241e-01           0.000e+00
## total.sulfur.dioxide      2.220e-16 5.809e-02           0.000e+00
## density                   0.000e+00 4.441e-16           3.805e-01
## pH                        6.066e-04 0.000e+00           4.870e-03
## sulphates                 8.252e-01 0.000e+00           3.888e-02
##                      total.sulfur.dioxide   density        pH sulphates
## fixed.acidity                   1.142e-04 0.000e+00 0.000e+00 4.119e-12
## volatile.acidity                3.764e-02 1.000e+00 0.000e+00 0.000e+00
## citric.acid                     1.000e+00 0.000e+00 0.000e+00 0.000e+00
## residual.sugar                  6.661e-15 0.000e+00 1.092e-02 1.000e+00
## chlorides                       5.228e-01 1.243e-14 0.000e+00 0.000e+00
## free.sulfur.dioxide             0.000e+00 1.000e+00 7.305e-02 3.888e-01
## total.sulfur.dioxide            0.000e+00 6.967e-02 1.095e-01 6.881e-01
## density                         4.354e-03 0.000e+00 0.000e+00 5.562e-08
## pH                              7.818e-03 0.000e+00 0.000e+00 5.995e-14
## sulphates                       8.602e-02 2.418e-09 2.220e-15 0.000e+00

The first table shows the correlation coefficients and the second table the significance values. We can see that some variables are correlated with each other.

As there does not exist an objective guideline for the interpretation of a correlation coefficient,[2] I suggest to consider correlations of more than 0.5 as high. I.e. variable pairs with correlation coefficients of more than 0.5 or less than -0.5. These are:

  • citric.acid <-> fixed.acidity
  • citric.acid <-> volatile.acidity
  • density <-> fixed.acidity
  • citric.acid <-> ph
  • fixed.acidity <-> ph
  • total.sulfur.dioxide <-> free.sulfur.dioxide

Let’s now investigate these variables in more detail:

library(car)
scatterplotMatrix(~citric.acid + fixed.acidity 
                   + volatile.acidity
                   + density + pH
                   + total.sulfur.dioxide + free.sulfur.dioxide, data=data)

plot of chunk unnamed-chunk-15

In this matrix scatterplot, the diagonal cells show histograms of each of the variables. Each of the off-diagonal cells is a scatterplot of two of the 7 variables. For example: the second cell in the first row in the second matrix is a scatterplot of citric.acid against fixed.acidity.

We see some variable pairs with positive or negative relationship. I summarized these in the following table:

Variable Pair Visual Relationship
citric.acid - fixed.acidity positive
density - fixed.acidity positive
ph - fixed.acidity negative
ph - citric.acidity negative
total.sulfur.dioxide - free.sulfur.dioxide positive

We’ll investigate these variable pairs in more detail in the next section.

2.3 Multivariate Plots and Analysis

Are variables depending on the index?

So far, we haven’t examined wether the data is independent of the index of the data. E.g., it could be possible that wine experts tend to evaluate different from each other. And if the order of the data is not independent of the experts (e.g. first 25 wines were evaluated by expert A, second 25 wines from expert B, etc.) we should see a pattern in the data.

The following code - adopted from Avril Coghlan [4] - prints out the 13 variables and their indices:

makeProfilePlot <- function(mylist,names)
{
    require(RColorBrewer)
    # find out how many variables we want to include
    numvariables <- length(mylist)
    # choose 'numvariables' random colours
    colours <- brewer.pal(numvariables,"Set1")
    # find out the minimum and maximum values of the variables:
    mymin <- 1e+20
    mymax <- 1e-20
    for (i in 1:numvariables)
    {
        vectori <- mylist[[i]]
        mini <- min(vectori)
        maxi <- max(vectori)
        if (mini < mymin) { mymin <- mini }
        if (maxi > mymax) { mymax <- maxi }
    }
    # plot the variables
    for (i in 1:numvariables)
    {
        vectori <- mylist[[i]]
        namei <- names[i]
        colouri <- colours[i]
        if (i == 1) { plot(vectori,col=colouri,type="l",ylim=c(mymin,mymax)) }
        else         { points(vectori, col=colouri,type="l")                                     }
        lastxval <- length(vectori)
        lastyval <- vectori[length(vectori)]
        text((lastxval-200),(lastyval),namei,col="black",cex=0.9)
    }
}

# Plot first 4 variables
library(RColorBrewer)
names <- colnames(data[2:5])
#mylist <- list(data[1],data[2], data[3])
#mylist <- list(data$fixed.acidity,data$volatile.acidity,data$citric.acid,data$residual.sugar, data$chlorides)
mylist <- data[2:5]
makeProfilePlot(mylist,names)

plot of chunk unnamed-chunk-16

# Plot next 4 variables
library(RColorBrewer)
names <- colnames(data[6:9])
#mylist <- list(data[1],data[2], data[3])
#mylist <- list(data$fixed.acidity,data$volatile.acidity,data$citric.acid,data$residual.sugar, data$chlorides)
mylist <- data[6:9]
makeProfilePlot(mylist,names)

plot of chunk unnamed-chunk-16

# Plot next 4 variables
library(RColorBrewer)
names <- colnames(data[10:13])
#mylist <- list(data[1],data[2], data[3])
#mylist <- list(data$fixed.acidity,data$volatile.acidity,data$citric.acid,data$residual.sugar, data$chlorides)
mylist <- data[10:13]
makeProfilePlot(mylist,names)

plot of chunk unnamed-chunk-16

We see from all three graphs that all variables are independent of the index. So no cause for concern here. But we can also see, that the variable total.sulfur.dioxide contains two heavy outliers which seem to lie next to each other.

Are these two outliers are not emerged by chance?

Let’s print out the exact indices of both outliers:

which(data$total.sulfur.dioxide>200)
## [1] 1080 1082

Unlike expected, both outliers don’t lie next to each other - so no cause for concern here. We do not have to correct this.

Are correlations between variables caused by wine quality?

We saw from scatterplot matrix in the previous section that some variables seems to have positive or negative relationships. Let’s investigate how these variable-pair-relationships differ between low and high quality wines:

data$qual_cat <- factor(data$qual_cat)
with (data, plot(citric.acid,fixed.acidity, col=qual_cat, pch=19))

plot of chunk unnamed-chunk-18

with (data, plot(density,fixed.acidity, col=qual_cat, pch=19))

plot of chunk unnamed-chunk-18

with (data, plot(pH, citric.acid, col=qual_cat, pch=19))

plot of chunk unnamed-chunk-18

with (data, plot(total.sulfur.dioxide, free.sulfur.dioxide, col=qual_cat, pch=19))

plot of chunk unnamed-chunk-18

These plots could answer the question whether the correlations are influenced by wine quality or not (this would be the case when for example most black points are on the left and most red points on the right). Unfortunately, we do not see any strong patterns - the correlations between fixed.acidity and density, between citric.acid and pH, and between free.sulfur.dioxide and total.sulfur.dioxide are not caused by wine quality.

Citric.acid is correlated with three other variables. I suggest to keep citric.acid and total.sulfur.dioxide as the latter variable produced a higher R2:

fit1 <- lm(quality ~ total.sulfur.dioxide, data=data)
fit2 <- lm(quality ~ free.sulfur.dioxide, data=data)
sprintf("R2 (total.sulfur.dioxide -> quality): %s", summary(fit1)$r.squared)
## [1] "R2 (total.sulfur.dioxide -> quality): 0.0342621169606877"
sprintf("R2 (free.sulfur.dioxide -> quality): %s", summary(fit2)$r.squared)
## [1] "R2 (free.sulfur.dioxide -> quality): 0.00256603613553544"

Note: Here I used a simple linar regression model to evaluate the influence of total.sulfur.dioxide and free.sulfur.dioxide on wine quality. Although it is not appropriate to apply a linear regression model (because of the ordinal character of the dependant variable quality), I decided to use this kind of statistical method as this a quick and simple way to evaluate which variable has more influence.

The overall analysis in which I answer the question which charecteristics have an influence on the wine quality will be carried out with a logistic regression:

Logistic Regression

Logistic regressions (logits) are used to model dichotomous outcome variables. Typical scenarios for logit-models are linear regressions in which the dependant variable is not metric. In the logit model the log odds of the outcome is modeled as a linear combination of the predictor variables. Log odds are an alternate way of expressing probabilities, log odds are the log of the odds ratio.[3]

I start with converting the newly build factor variable qual_cat into a dichotomous variable:

# Convert qual_cat in binary variable (0="low quality""; 1="high quality"):
y <- ifelse(data$qual_cat=="high Quality", 1, 0)
table(y)
## y
##   0   1 
## 744 855

Wines of high (low) quality are now labeled with 1 (0).

Next I call the logit model:

mylogit <- glm( y ~ volatile.acidity + citric.acid +
            residual.sugar + chlorides + total.sulfur.dioxide +
            sulphates + alcohol, data = data, family = "binomial")
summary(mylogit)
## 
## Call:
## glm(formula = y ~ volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + total.sulfur.dioxide + sulphates + alcohol, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.135  -0.855   0.320   0.859   2.301  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -7.99561    0.81848   -9.77  < 2e-16 ***
## volatile.acidity     -3.33332    0.45466   -7.33  2.3e-13 ***
## citric.acid          -0.60136    0.40093   -1.50   0.1336    
## residual.sugar        0.04891    0.04470    1.09   0.2739    
## chlorides            -4.14268    1.47283   -2.81   0.0049 ** 
## total.sulfur.dioxide -0.01247    0.00202   -6.17  6.6e-10 ***
## sulphates             2.80513    0.43809    6.40  1.5e-10 ***
## alcohol               0.87606    0.07135   12.28  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2209.0  on 1598  degrees of freedom
## Residual deviance: 1670.7  on 1591  degrees of freedom
## AIC: 1687
## 
## Number of Fisher Scoring iterations: 4

The matrix shows the logistic regression coefficients. These are the change in the log odds of the outcome variable qual_cat for a one unit increase in the predictor variable. The exponential function of this gives us the odd ratios which are easier to interpret:

exp(cbind(OR = coef(mylogit), confint(mylogit)))
##                             OR     2.5 %    97.5 %
## (Intercept)          3.369e-04 0.0000664  0.001646
## volatile.acidity     3.567e-02 0.0144299  0.085862
## citric.acid          5.481e-01 0.2489936  1.200110
## residual.sugar       1.050e+00 0.9610567  1.146004
## chlorides            1.588e-02 0.0008202  0.270856
## total.sulfur.dioxide 9.876e-01 0.9836514  0.991481
## sulphates            1.653e+01 7.1438381 39.802301
## alcohol              2.401e+00 2.0930772  2.768992

The above table shows the odd ratios together with the confidence interval. The odd ratios give the change in the log odds of the outcome variable for a one unit increase in the predictor variable. E.g.: for every one unit change in alcohol, the log odds of high quality (versus low quality) increases by 0.87606, i.e.: for every one-unit increase in alcohol the probability that this wine will be evaluated as high quality increases by 2.4% (`exp(0.87606)).

Finally, let’s check how well the model fit:

chi <- with(mylogit, null.deviance - deviance)
df <- with(mylogit, df.null - df.residual)
p_value <- with(mylogit, pchisq(null.deviance - deviance, df.null - df.residual, 
                                lower.tail = FALSE))
sprintf("Test Statistic (chi2): %s", chi)
## [1] "Test Statistic (chi2): 538.304915401857"
sprintf("Degrees of Freedom: %s", df)
## [1] "Degrees of Freedom: 7"
sprintf("P-value: %s", p_value)
## [1] "P-value: 4.63479040986253e-112"

The chi-square of 538.3049 with 7 degrees of freedom and an associated p-value of less than 0.001 tells us that our model as a whole fits significantly better than an empty model.

3 Final Plots and Summary

According to the low range of the variable quality, I decided throughout the analysis to split the dataset in two groups: wines of low quality (marked red) and wines of high quality (marked as blue).

After this, we can easily see how both groups differ:

#Create means of variables for each quality category:
features <- c("volatile.acidity", "citric.acid", "residual.sugar", 
              "chlorides", "sulphates", "total.sulfur.dioxide", "alcohol")
features_translated <- c("acetic acid", "citric acid", "residual sugar",
                     "sodium chloride", "sulphates", 
                     "total sulfur dioxide", "alcohol")
mean_lowQual<-apply(data[data$qual_cat=="low Quality",][,features],2,mean)
mean_highQual<-apply(data[data$qual_cat=="high Quality",][,features],2,mean)

#Merge both rows:
means <- rbind(mean_lowQual,mean_highQual)
means
##               volatile.acidity citric.acid residual.sugar chlorides
## mean_lowQual            0.5895      0.2378          2.542   0.09299
## mean_highQual           0.4741      0.2999          2.536   0.08266
##               sulphates total.sulfur.dioxide alcohol
## mean_lowQual     0.6185                54.65   9.926
## mean_highQual    0.6926                39.35  10.855
# Create barplot:
layout( matrix(c(1,2,3),1,3), widths=c(3,1), heights=c(1,2))

barplot(means[,1:5],beside=T,
        col = c("red", "blue"), ylab = "g / dm^3",
        names = features_translated[1:5], las=2)

legend("topright", legend = c("low quality wines","high quality wines"), 
       pch=15, col = c("red", "blue") )

barplot(means[,6], width=1,beside=T,
        col = c("red", "blue"), ylab = "mg / dm^3",
        names = features_translated[6], las=2, space=0)

barplot(means[,7], width=1,beside=T,
        col = c("red", "blue"), ylab = "% by volume",
        names = features_translated[7],las=2, space=0)

plot of chunk unnamed-chunk-24

Sugar for example does not have an influence on wine quality. Alcohol, acid and sulphates on the other side does have an influence on wine quality: high quality wines contain more alcohol, citric acid and sulphates on average.

However, the dataset contains some outliers:

library(ggplot2)
ggplot(data=data, aes(x=quality, y=alcohol, fill=qual_cat)) + 
    geom_boxplot() + ggtitle("Alcohol in low quality and high quality wines") + 
    xlab('Wine Quality') + ylab("Alcohol (%)") + 
    theme(legend.title=element_blank())

plot of chunk unnamed-chunk-25

High percentage of alcohol does not necessarily guarantee high wine quality. Some low quality wines have alcohol strengths of more than 12%.

However, the most important ingredients which determine wine quality (beside sulphates) alcohol, volatile acity and citric acid. The following graph summarize this findings :

library(gridExtra)
levels(data$qual_cat) <- c("High Quality Wines", "Low Quality Wines")

g1 <- ggplot(data, aes(alcohol, volatile.acidity)) + ggtitle("Alcohol vs. Volatile Acidity in High- and Low-Quality Wines") +
ylab("Volatile Acidity (g / dm^3)") + xlab("Alcohol (% volume)") +
geom_point(aes(color = qual_cat), size=2, alpha=1/2) +
geom_smooth(method="lm") + 
facet_grid(.~qual_cat) +
theme_bw()+ theme(legend.position = "none")

g2 <- ggplot(data, aes(alcohol, citric.acid)) + ggtitle("Alcohol vs. Citric Acid in High- and Low-Quality Wines") +
ylab("Citric Acid (g / dm^3)") + xlab("Alcohol (% volume)") +
geom_point(aes(color = qual_cat), size=2, alpha=1/2) +
geom_smooth(method="lm") + 
facet_grid(.~qual_cat) +
theme_bw()+ theme(legend.position = "none")

grid.arrange(g1, g2, ncol=1)

plot of chunk unnamed-chunk-26

We can see from the graphs that the relationship between alcohol and volatile acitiy as well as between alcohol and citric acid differs between high and low quality wines.

These graphs provide useful information for wine-grower:

For example: we know that high quality wines usually contain more alcohol. A wine maker who therefore wants to increase the amount of alcohol, should also control the amount of volatile acidity as high-quality wines are characterized by a negative realtionship between both substances.

The opposite is true for citric acid: while increasing the amount of alcohol, the wine maker should try to increase the citric acid as high-quality wines are characterized by a positive relationship between alcohol and citric acid.

4 Reflection

The analysis identified the key factor which influences the quality of red wines: alcohol, citric acid, volatile acidiyt, and sulphates. Other incredients like sugar or chloride only have a minor influence on wine quality.

There are a few limitations to these interpretations: First of all, one should keep in mind that the results are only valid for this specific sort of wine and not for red wines in general. Secondly, although the dataset consists of more than 5000 wines, the evaluated qualities are mostly in the middle, between 5 and 7 on a scale of 10. It would be interesting, for example what characteristics extraordinary good (or bad) wines would have.

Throughout the analysis I raised the question how accurate and objective the wine experts validations are. Did all wine experts come to the same test result? Unfortunately the dataset does not contain a variable which identifies the test person. A possible test design for future studies could be to investigate if clusters of wine experts exist which evaluate in a similar way. Those clusters could be described by socio- or psychodemographic variables. E.g.: do younger wine experts evaluate different compared to older experts?

5 References

  1. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In: Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

  2. Interpretation of the size of a correlation (2015). In: Wikipedia. Retrieved: June 24, 2015, from Wikipedia.

  3. R Data Analysis Examples: Logit Regression (n.d.). In: Institute for Digital Research and Education (idre) Retrieved: June 24, 2015, from www.ats.ucla.edu.

  4. A. Coghlan. Using R for Multivariate Analysis (n.d.). In: Little Book of R for Multivariate Analysis Retrieved: August 5, 2015, from www.readthedocs.com.


Appendix

Variable Description

Variable Description
fixed acidity most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
volatile acidity the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
citric acid found in small quantities, citric acid can add ‘freshness’ and flavor to wines
residual sugar the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
chlorides the amount of salt in the wine
free sulfur dioxide the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
total sulfur dioxide amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
density the density of water is close to that of water depending on the percent alcohol and sugar content
pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
sulphates a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
alcohol the percent alcohol content of the wine
quality Ouput Variable (score between 0 and 10)

Multiple Regression

model1<-lm(quality ~ 
             fixed.acidity +
             volatile.acidity +
             citric.acid +
             residual.sugar +
             chlorides +
             free.sulfur.dioxide +
             total.sulfur.dioxide +
             density  +
             pH +
             sulphates +
             alcohol, data=data )
summary(model1)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.689 -0.366 -0.047  0.452  2.025 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.20e+01   2.12e+01    1.04    0.300    
## fixed.acidity         2.50e-02   2.59e-02    0.96    0.336    
## volatile.acidity     -1.08e+00   1.21e-01   -8.95  < 2e-16 ***
## citric.acid          -1.83e-01   1.47e-01   -1.24    0.215    
## residual.sugar        1.63e-02   1.50e-02    1.09    0.276    
## chlorides            -1.87e+00   4.19e-01   -4.47  8.4e-06 ***
## free.sulfur.dioxide   4.36e-03   2.17e-03    2.01    0.045 *  
## total.sulfur.dioxide -3.27e-03   7.29e-04   -4.48  8.0e-06 ***
## density              -1.79e+01   2.16e+01   -0.83    0.409    
## pH                   -4.14e-01   1.92e-01   -2.16    0.031 *  
## sulphates             9.16e-01   1.14e-01    8.01  2.1e-15 ***
## alcohol               2.76e-01   2.65e-02   10.43  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared:  0.361,  Adjusted R-squared:  0.356 
## F-statistic: 81.3 on 11 and 1587 DF,  p-value: <2e-16
model2 = update(model1, .~.-citric.acid, .~.-residual.sugar)
summary(model2)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6896 -0.3697 -0.0464  0.4563  2.0247 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.22e+01   2.12e+01    1.05    0.294    
## fixed.acidity         1.43e-02   2.45e-02    0.58    0.560    
## volatile.acidity     -1.00e+00   1.02e-01   -9.81  < 2e-16 ***
## residual.sugar        1.53e-02   1.50e-02    1.02    0.306    
## chlorides            -2.01e+00   4.05e-01   -4.97  7.3e-07 ***
## free.sulfur.dioxide   4.80e-03   2.14e-03    2.24    0.025 *  
## total.sulfur.dioxide -3.50e-03   7.03e-04   -4.99  6.8e-07 ***
## density              -1.81e+01   2.16e+01   -0.84    0.403    
## pH                   -4.06e-01   1.92e-01   -2.12    0.034 *  
## sulphates             9.13e-01   1.14e-01    7.98  2.7e-15 ***
## alcohol               2.71e-01   2.62e-02   10.36  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.648 on 1588 degrees of freedom
## Multiple R-squared:  0.36,   Adjusted R-squared:  0.356 
## F-statistic: 89.3 on 10 and 1588 DF,  p-value: <2e-16
model3 = update(model2, .~.-density)
summary(model3)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     pH + sulphates + alcohol, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6724 -0.3681 -0.0479  0.4663  2.0388 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.50087    0.61152    7.36  2.9e-13 ***
## fixed.acidity        -0.00286    0.01342   -0.21   0.8310    
## volatile.acidity     -1.01571    0.10118  -10.04  < 2e-16 ***
## residual.sugar        0.00785    0.01202    0.65   0.5137    
## chlorides            -2.04511    0.40246   -5.08  4.2e-07 ***
## free.sulfur.dioxide   0.00496    0.00213    2.32   0.0203 *  
## total.sulfur.dioxide -0.00355    0.00070   -5.07  4.3e-07 ***
## pH                   -0.49727    0.15704   -3.17   0.0016 ** 
## sulphates             0.88880    0.11073    8.03  1.9e-15 ***
## alcohol               0.28802    0.01692   17.02  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.648 on 1589 degrees of freedom
## Multiple R-squared:  0.36,   Adjusted R-squared:  0.356 
## F-statistic: 99.2 on 9 and 1589 DF,  p-value: <2e-16
# Stepwise:
step(model1, direction="backward")
## Start:  AIC=-1375
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol
## 
##                        Df Sum of Sq RSS   AIC
## - density               1       0.3 667 -1377
## - fixed.acidity         1       0.4 667 -1377
## - residual.sugar        1       0.5 667 -1376
## - citric.acid           1       0.6 667 -1376
## <none>                              666 -1375
## - free.sulfur.dioxide   1       1.7 668 -1373
## - pH                    1       2.0 668 -1373
## - chlorides             1       8.4 675 -1357
## - total.sulfur.dioxide  1       8.4 675 -1357
## - sulphates             1      27.0 693 -1314
## - volatile.acidity      1      33.6 700 -1299
## - alcohol               1      45.7 712 -1271
## 
## Step:  AIC=-1377
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     pH + sulphates + alcohol
## 
##                        Df Sum of Sq RSS   AIC
## - fixed.acidity         1       0.1 667 -1379
## - residual.sugar        1       0.2 667 -1378
## - citric.acid           1       0.7 667 -1377
## <none>                              667 -1377
## - free.sulfur.dioxide   1       1.8 669 -1374
## - pH                    1       4.3 671 -1368
## - total.sulfur.dioxide  1       8.7 675 -1358
## - chlorides             1       8.8 675 -1358
## - sulphates             1      27.3 694 -1315
## - volatile.acidity      1      35.0 702 -1297
## - alcohol               1     119.7 786 -1115
## 
## Step:  AIC=-1379
## quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + 
##     free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates + 
##     alcohol
## 
##                        Df Sum of Sq RSS   AIC
## - residual.sugar        1       0.3 667 -1380
## - citric.acid           1       0.6 667 -1379
## <none>                              667 -1379
## - free.sulfur.dioxide   1       1.9 669 -1376
## - pH                    1       7.1 674 -1364
## - chlorides             1       9.9 677 -1357
## - total.sulfur.dioxide  1      10.0 677 -1357
## - sulphates             1      27.7 694 -1316
## - volatile.acidity      1      36.2 703 -1296
## - alcohol               1     120.6 787 -1115
## 
## Step:  AIC=-1380
## quality ~ volatile.acidity + citric.acid + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol
## 
##                        Df Sum of Sq RSS   AIC
## - citric.acid           1       0.5 668 -1381
## <none>                              667 -1380
## - free.sulfur.dioxide   1       2.1 669 -1377
## - pH                    1       7.1 674 -1365
## - total.sulfur.dioxide  1       9.8 677 -1359
## - chlorides             1       9.8 677 -1359
## - sulphates             1      27.4 695 -1317
## - volatile.acidity      1      36.0 703 -1298
## - alcohol               1     122.7 790 -1112
## 
## Step:  AIC=-1381
## quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol
## 
##                        Df Sum of Sq RSS   AIC
## <none>                              668 -1381
## - free.sulfur.dioxide   1       2.4 670 -1377
## - pH                    1       7.1 675 -1366
## - total.sulfur.dioxide  1      10.8 678 -1357
## - chlorides             1      10.8 678 -1357
## - sulphates             1      27.1 695 -1319
## - volatile.acidity      1      42.3 710 -1285
## - alcohol               1     124.5 792 -1109
## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol, data = data)
## 
## Coefficients:
##          (Intercept)      volatile.acidity             chlorides  
##              4.43010              -1.01275              -2.01781  
##  free.sulfur.dioxide  total.sulfur.dioxide                    pH  
##              0.00508              -0.00348              -0.48266  
##            sulphates               alcohol  
##              0.88267               0.28930
# Regression Diagnostics:
par(mfrow=c(2,2))                    # visualize four graphs at once
plot(model3)

plot of chunk unnamed-chunk-27